---
title: "Replication Instruction"
output:
  html_document:
    df_print: paged
  pdf_document: default
vignette: |
  %\VignetteIndexEntry{Web scraping 101} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}
---

<style type="text/css">

body{ /* Normal  */
      font-size: 15px;
  }
td {  /* Table  */
  font-size: 8px;
}
h1.title {
  font-size: 30px;
  color: DarkRed;
}
h1 { /* Header 1 */
  font-size: 22px;
  color: Black;
}
h2 { /* Header 2 */
    font-size: 18px;
  color: Black;
}
h3 { /* Header 3 */
  font-size: 14px;
  color: Black;
}
code.r{ /* Code block */
    font-size: 12px;
}
pre { /* Code block - determines code spacing between lines */
    font-size: 14px;
}
</style>


```{r, echo=FALSE}
knitr::opts_chunk$set(comment = "#>", collapse = TRUE)
```

```
R version 4.0.3 (2020-10-10)
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin17.0
```


## Overview

-  This document introduces data sets and related codes to replicate the results in **Taegyoon Kim. Violent Political Rhetoric on Twitter. 2022. *Political Science Research and Methods*.** The replication materials involve 7 figures (Figure 2-6, Figure A1-2) and 14 tables (Table 1-4, Table A1, Table A3-10). The codes are written both in R and and in Python. 

- Table A2 references the ICR scores reported in studies on aggressive online speech. These include [Theocharis et al. (2016)](https://onlinelibrary.wiley.com/doi/abs/10.1111/jcom.12259), [Munger (2021)](https://www.cambridge.org/core/journals/journal-of-experimental-political-science/article/abs/dont-me-experimentally-reducing-partisan-incivility-on-twitter/CA72D8773AC00916F5A551F80E6C06D3), [Wulczyn, Thain, and Dixon (2017)](https://dl.acm.org/doi/abs/10.1145/3038912.3052591), and [Cheng, Danescu-Niculescu-Mizil, and Leskovec (2015)](https://ojs.aaai.org/index.php/ICWSM/article/view/14583).

- The scripts are stored in **script** folder (if you unzip **kim_psrm_replication.zip** you will see the folder). The data sets necessary to replicate figures tables are stored in **data** folder. The data sets are also specified in the corresponding script. The relevant packages are specified in the script too. 

- Note that there are objects called `path_data` and `path_output` in every script. They define the location of necessary data sets and the location of output figures and tables, respectively.

- Run each script at once, which will generate and save all the relevant figures (in pdf) and tables (in tex or csv). If you want to run codes for each figure or table one at a time, run the entire code block (all the code lines for the particular figure or table) at once. All the output tables and figures will be located in **output** folder (which is an empty folder before you run scripts).

- All necessary codes and data sets to replicate the results are listed in the next section. As Twitter's terms of services restrict the redistribution of its content to third parties (see **content redistribution** in https://developer.twitter.com/en/developer-terms/agreement-and-policy), I share the IDs of tweets for the training set for machine classification and the violent tweets data set (see the last two sections).

- Make sure to compare the replicated figures and tables against the attached version of the paper (**kim_psrm_2022.pdf**). The attached version is the most up-to-date version. In particular, it contains replicable visualization (for Fightin' Words analysis) and classification performance (dealing with the issue of random sampling). It is also proofread for typos.


## List of Scripts

### Script 1. `kim_psrm_fig2_fig4_tbla6_tbla10.R`
- Data set: `df_fw_notext.csv`, `df_fw_dfm.rds`, `df_user.csv`
- Output figures: Figure 2, Figure 4
- Output tables: Table A6, Table A10

### Script 2. `kim_psrm_fig3_fig5a_fig5c_figa2_fig6c.py`
- Data set: `df_timeline.csv`, `df_ideology_violent.csv`, `df_ideology_nonviolent.csv`, `df_ideology_violent_without_trump.csv`, `df_ideology_nonviolent_without_trump.csv`, `df_distance_violent.csv`, `df_distance_nonviolent.csv`
- Output figures: Figure 3, Figure 5a, Figure 5c, Figure A2, Figure 6c

### Script 3. `kim_psrm_fig5b_fig5d_fig6a_fig6b.R`
- Data set: `df_ideology.csv`, `df_extremity.csv`, `df_homophily.csv`
- Output figures: Figure 5b, Figure 5d, Figure 6a, Figure 6b

### Script 4. `kim_psrm_figa1_tbla1.R`
- Data set: `df_prop.csv`
- Output figures: Figure A1
- Output tables: Table A1

### Script 5. `kim_psrm_tbl1-4_tbla7-9.R`
- Data set: `df_hashtag.csv`, `df_hashtag_weekly.rds`, `df_mention.csv` 
- Output tables: Table 1-4, Table A7-A9

### Script 6. `kim_psrm_tbla3.R`
- Data set: `df_icr.csv`
- Output table: Table A3

### Script 7. `kim_psrm_tbla4_tbla5.py`: 
- Data set: `df_cv_rf_count.csv`, `df_cv_lr_tfidf.csv`, `df_cv_lr_glove.csv`, `df_cv_rf_count.csv`, `df_cv_rf_tfidf.csv`, `df_cv_rf_glove.csv`, `df_cv_xgb_count.csv`, `df_cv_xgb_tfidf.csv`, `df_cv_xgb_glove.csv`, `df_cv_bert.csv`
- Output tables: Table A4, Table A5



## Machine Classifier

The machine classifier introduced in the paper is based on BERT (https://arxiv.org/abs/1810.04805). The training set (N = 10,097) can be found here: https://github.com/taegyoon-kim/violent_political_rhetoric_on_twitter/blob/master/violent_political_rhetoric_training.csv. It contains tweet IDs and their binary labels. Train and machine classify your tweets using a BERT fine-tuned classifier: https://github.com/taegyoon-kim/violent_political_rhetoric_on_twitter/blob/master/violent_political_rhetoric_classifier_BERT.ipynb. I recommend running the script on Google Colaboratory.

Or, to build a new classifier from scratch (including manual labeling): <br/>
1) Retrieve tweets that contain a Twitter account (i.e.,'@account_name') via Twitter APIs <br/>
2) Computer-generate a list of violent keywords using the following script and data: https://github.com/taegyoon-kim/violent_political_rhetoric_on_twitter/blob/master/violent_political_rhetoric_violent_keywords.py. <br/>
3) Human-label a subset of the tweets gathered from the previous two steps based on the following coding scheme: https://github.com/taegyoon-kim/violent_political_rhetoric_on_twitter/blob/master/violent_political_rhetoric_manual_labeling.pdf. <br/>
4) Machine classify all other tweets using a BERT fine-tuned classifier: https://github.com/taegyoon-kim/violent_political_rhetoric_on_twitter/blob/master/violent_political_rhetoric_classifier_BERT.ipynb. <br/>


## Violent Tweets Data Set

The IDs of violent tweets used for the analysis is included in `df_violent_tweet_ids.csv`. You can access Twitter APIs to retrieve a variety of information about a single tweet using tweet IDs (see https://developer.twitter.com/en/docs/twitter-api/tweets/lookup/introduction). Note that, as the tweets included in the data set are highly likely to violate Twitter's rules (https://help.twitter.com/en/rules-and-policies/violent-threats-glorification), some of the tweets might have already been taken down or the related account might have been suspended.

